Introduction

On daily basis, we produced and encountered huge text data in spoken or written forms and different languages. However, the only language computer understand is numbers. So, to be efficient, we need to train computers to understand spoken and written words. This can be achieved through Natural language processing (NLP). NLP gives computers ability to understand written text and spoken words in much the same way human beings can. It enables computers to process human language the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.

Aims

  1. To understand the basic relationships observed in the given data in order to interpret it more efficiently.
  2. To explore each data file, en_us.blog, en_us.new and en_us.twitter, for file sizes, number of characters, words and lines.
  3. To take samples of each file and combine them for further analysis
  4. To examing word distribution in each files using table, histogram and word cloud.
  5. To build \(N-grams\) of words to show relationship between words in the sample dataset
  6. To summarize word distribution and relationship in \(N-gram\) with histograms and wordclouds.
  7. To build backoff predictive model for next word prediction.

Useful Packages

library(tidytext, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(stringi, warn.conflicts = FALSE)
library(plotly, warn.conflicts = FALSE)
library(qdapRegex, warn.conflicts = FALSE)
library(wordcloud, warn.conflicts = FALSE)
library(RColorBrewer, warn.conflicts = FALSE)
library(syuzhet, warn.conflicts = FALSE)
library(SentimentAnalysis, warn.conflicts = FALSE)
library(sentimentr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)

Basic Data Exploration

Files Sizes

The sizes in megabites (MB) for en_us.blog, en_us.new and en_us.twitter are shown below.

                 Size
Blogs File   200.4242
News File    196.2775
Twitter File 159.3641

Data Import

The three data file will be imported and sample taken for further analysis. We will remove profane word by filtering our data with words in the profanity_txt file.

  1. en_us.blogs
  2. en_us.news
  3. en_us.twitter
  4. Profane words
setwd("C:/Users/justi/Documents/Olu_Drive/Coursera/Data_Science_Statistics_and_Machine_Learning_Specialization/Capstone/en_US")
blogs_txt <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news_txt <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter_txt <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
profanity_txt <- readLines("profanity.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
profanity_df <- tibble(profanity_txt)
special_txt <- readLines("special.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
special_df <- tibble(special_txt)

Brief Data Summary

Here we determined basic features of the data en_us.blog, en_us.new and en_us.twitter. How many characters (words, spaces and others), words, lines are in each data? Together with the files sizes, all are presented in the a table below.

File Type File Size Number of Characters Number of Words Number of Lines
Blogs 200.42 206824505 37546250 899288
News 196.28 15639408 2674536 77259
Twitter 159.36 162096241 30093413 2360148

Sampling

I will sample 0.5% of each data set blogs_txt, news_txt and twitter_txt to form a single set, sample1_txt. See below the first 3 lines of the sample-txt

[1] "Or put another way – in the spirit of this site’s mission – it’s all bollocks."
[2] "No Regrets for Our Youth – 0"                                                  
[3] "Tom: See you!"                                                                 

Data Preprocessing

First, we need to clean the data and remove irrelevant characters so we can concentrate on the important words from this file.

  1. Remove lines with latin1 and ASCII characters.
latin1ASII_func <- grep("latin1ASII", iconv(sample1_txt, "latin1", "ASCII", sub="latin1ASII"))
sample2_txt <- sample1_txt[-latin1ASII_func]
  1. Remove special characters, digits and extra white space.

See below the first 5 lines of the clean set.

sample3_txt <- gsub("&amp", " ", sample2_txt)
sample3_txt <- gsub("RT :|@[a-z,A-Z]*: ", " ", sample3_txt) # remove tweets
sample3_txt <- gsub("@\\w+", " ", sample3_txt)
sample3_txt <- gsub("[[:digit:]]", " ", sample3_txt) # remove digits
sample3_txt <- gsub(" #\\S*"," ", sample3_txt)  # remove hash tags 
sample3_txt <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample3_txt) # remove url
sample3_txt <- gsub("[^[:alnum:][:space:]']", "", sample3_txt) # Remove punctuation except apostrophes
sample3_txt <- rm_white(sample3_txt) # remove extra spaces using `qdapRegex` package

See below the first 5 lines of the clean set.

[1] "Tom See you"                                                                                                                                           
[2] "See it's all the fault of evolution"                                                                                                                   
[3] "But seriously Wells Youngs WHAT IS THIS BULL CRAP ABOUT NOT SELLING IT IN THE UK UNTIL NEXT YEAR Get it sorted I want to be drinking this at Christmas"

Tokenization

A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.

Create one-token-per-document-per-row

We need to both break the text into individual tokens and transform it to a tidy data structure. This equivalent to a unigram \(1-gram\). Also, we need to filter out the profane word from the text corpus.

unigram <- sample_df %>%
  unnest_tokens(word, text) %>% 
  filter(!word %in% profanity_df$profanity_txt) %>% 
  filter(!word %in% special_df$special_txt) %>% # Remove profane words
  drop_na()

Find most frequent words

The count() will will be useful here. This will help use to visualize the dataset. See below five most frequent word and their frequencies \(n\).

unigram <- unigram %>% 
  count(word, sort = TRUE) %>% 
  mutate(word = reorder(word, n)) %>% 
  filter(n > 10)
head(unigram, 5)
# A tibble: 5 x 2
  word      n
  <fct> <int>
1 the   10521
2 to     7141
3 i      5938
4 a      5843
5 and    5626

Data Visualization of the data

Create histogram

We use ggplot to generate the histogram and line graph below. Word occurence is more than 800 times.

Line graph